RaiskaaHTML

This is the manual for RaiskaaHTML, a tool to strip junk from HTML documents. This document describes version 1.1, released 20.10.98.

Introduction
Requirements
Installation
Usage from CLI
Dealing with Errors
Usage in Directory Opus
Usage in Your Web Browser
Command Line Options
Copyright
Support and Updates
Future
History

Introduction

As the technology behind Internet and WWW is an incredible piece of crap, it is quite painful for the user to access these resources online.

Nevertheless it offers a basically easy to use and cheap way to obtain information that would be difficult or expensive to get from somewhere else. Therefor the only reasonable application of WWW is to download everything of use to ones hard disk. This can efficiently be done with recursive download tools like wget. To make their usage more painless, browser frontends like WgetRexx exist.

Still the problem remains that these documents often use dreadful colors and layout, not only looking ugly but also wasting space. Although most browsers offer the option to ignore the document provided colors, this usually affects all documents.

This is where RaiskaaHTML comes in: it can remove all sorts of junk from HTML documents on your local storage, including annoying colors, invisible meta tags, useless scripts, bloated comments and redundant white space.

Requirements

To successfully run RaiskaaHTML, you need at least AmigaOS 2.04 with ARexx installed and running.

Furthermore rexxdossupport.library has to be installed. RexxDosSupport is copyright by Hartmut Goebel and can be downloaded from aminet:util/rexx/rexxdossupport.lha.

For fully understanding all possible command line options, some understanding of HTML is required. For using only the basic functions, this is hopefully not necessary.

As long as the HTML code you process is correct, you won't get any syntax error messages. RaiskaaHTML is quite tolerant and can cope with most HTML code, even if it contains some minor errors - it does not have to be completely SGML conformant. With serious errors however it refuses to convert the document. You are then supposed to fix the problem and retry. For that you will again need some understanding of HTML.

If you want to see error messages show up in the ScMsg message browser instead of in the console, sc:c/scmsg being part of the SAS/c compiler package has to be installed. This is completely optional though.

Installation

There are two files you have to take care of:

RaiskaaHTML.rexx - An ARexx script dealing with command line arguments using the usual syntax like most other commands under AmigaOS. It starts the program described below and also deals with error handling. In general this is the one you invoke.
TodellakinRaiskaaHTML - An unusable command line filter program not further documented here. It is coded in C and therefor reasonably fast. RaiskaaHTML.rexx invokes it and therefor has to find it in the Workbench search path.

To install all this, copy TodellakinRaiskaaHTML to some directory within your Workbench search path and store RaiskaaHTML.rexx wherever you put your supporting scripts. You can use the following commands in CLI:

cd where ever you extracted the archive/RaiskaaHTML
copy RaiskaaHTML.rexx TO rexx:
copy TodellakinRaiskaaHTML TO c:

Preferably you should also store this manual in some place where you can find it again, for instance in Help:RaiskaaHTML.

Usage from CLI

For a quick example, check example/fancy.html coming with this archive: it contains a dreadful looking web page as they are often done by insane web authors. Now enter in CLI:

rx RaiskaaHTML.rexx example/fancy.html to ram:fancy.html blink color

Load the resulting ram:fancy.html in your browser and notice the difference. Isn't it a relieve?

If you prefer to use RaiskaaHTML from CLI, you might want to add the following line to your s:Shell-Startup:

alias raiskaa rx RaiskaaHTML.rexx blink color []

Feel free to replace "blink color" by whatever options fit you.

Command line options are discussed in more detail later. Here is a short explanation of some switches you can specify to influence the output document:

Blink - Remove blinking text
Color - Remove document background and text colors
Font - Remove font faces (but not sizes)
Link, Meta - Remove some usually unneeded information
Space - Remove redundant blanks and linefeeds

The below options might cause problems in certain documents. Use them with care.

DocType - Remove document type declaration
Script - Remove JavaScript and other HTML inline scripts
SGML - Remove SGML comments

Furthermore is it possible to specify a directory as From and to overwrite documents by not specifying To.

As no reasonable thinking human being remembers command line options, you might also be interested in learning how to integrate the RaiskaaHTML into Directory Opus and your your web browser. But before there is one thing to discuss that is easier to explain in CLI:

Dealing with Errors

As a matter of fact, many HTML documents contain syntax errors.

RaiskaaHTML does not parse the HTML code very exactly, but still requires the basic tag/attribute structure to be intact. If this is not the case, it displays an error message shortly describing the problem. There are two classes of errors: warnings and errors.

A warning points out a minor problem that can be worked around. For example, if the documents contains a ">" sign not being use to denote an end of tag call, it is replaced by ">" in the output. Still the actual problem might have occurred already earlier, and the output might not look like you want it to.

An error however is too difficult for RaiskaaHTML to fix. Therefor you have to do it manually.

You can find an example document containing a minor problem in example/warning.html. Let's see what happens if you type the following in CLI:

rx RaiskaaHTML.rexx example/warning.html to ram:test.html

This results in the following output:

example/warning.html:   7, 20: warning: unmatched ">"

Even though the document was not correct, ram:test.html could be created.

If you are wondering what those two numbers mean: 7 denotes the line and 20 the column in the input file where the problem was experienced. This information becomes handy if the document contains serious errors as the included example/error.html. Try this in CLI:

rx RaiskaaHTML.rexx example/error.html to ram:test.html

This time, no output is written and following message is displayed:

example/error.html:  12, 27: error: "*" is not an HTML attribute

If you take a look at the example/error.html, you should notice that the "<" in line 12 has to be changed to "<". Load the document into your editor and fix the problem. After that you can start RaiskaaHTML again and it should accept the data without any further whining.

Maybe you think loading the documents and moving to the requested line is a cumbersome task. Not if you have the SAS/c message browser installed. In this case, repeat the last example with ScMsg enabled:

rx RaiskaaHTML.rexx example/error.html to ram:test.html ScMsg

Then click in the browser and wait for your editor to load the document and jump at least to the proper line. (Unfortunately the message browser is too stupid to deal with columns.)

As the purpose of RaiskaaHTML is not to help you fixing errors in HTML documents, the error handling mechanisms are not very sophisticated and the resulting messages might not be of much help for you. In such a case, use a real syntax checker before trying run RaiskaaHTML again.

Also note that the fact that a document was accepted by RaiskaaHTML does not tell much about its correctness because the program hardly cares about anything. It only tests for certain tags and attributes to be stripped and looks for the <pre> tag and SGML comments.

Usage in Directory Opus

Of course you can assign RaiskaaHTML to a menu, button or what ever in Directory Opus. When the function editor pops up, specify the following command:

Function editor
Type	Command
ARexx	RaiskaaHTML.rexx {o} blink color
Flags	CD source Do all files Output to window Rescan source Window close button

Feel free to replace "blink color" by whatever options fit you.

This way you can select multiple files and whole directories containing HTML documents. When started, a window opens where RaiskaaHTML displays its progress status.

Usage in Your Web Browser

If you have a browser that allows you to create ARexx macros, you can assign RaiskaaHTML to a button or menu.

RaiskaaHTML also accepts the URI-format for the From parameter. Naturally it can only deal with URIs of type file://localhost/ because it does not have a network functions implemented. This should not be a problem as you do not have write access to documents written by other people anyway, so the document written by RaiskaaHTML has to end up on your hard disk in any case.

Refer to the manual of your browser how to specify the URI currently browsing as command line parameter to scripts. In general, it should look something like this:

Settings/ARexx
Macro	Command
Raiskaa!	rx RaiskaaHTML.rexx %u blink color

Feel free to replace "blink color" by whatever options fit you.

This way you can process only one document a time.

If your browser does not open a console automatically, you have to redirect the output manually, e.g. by appending ">con:////RaiskaaHTML/CLOSE/WAIT" to the above command.

Command Line Options

Template

rx RaiskaaHTML.rexx From/A, To/K, OnError/K, Blink/S, Color/S, DocType/S, Font/S, Heading/S, Ignore/S, Linefeed/S, Link/S, Meta/S, Quiet/S, Ruler/S, ScMsg/S, Script/S, SGML/S, Space/S, Table/S, Toggle/S

Basic Options

From

The From options specifies the HTML document you want to convert and is required.

If you do not specify a file but a directory, the directory and all its subdirectories are scanned for files with the suffices .html and .htm. This is useful if you want to convert a whole downloaded site with the same set of options.

Quiet

With the Quiet switch enabled, the program does not display a report at the end of execution. In that case, you won't know how many bytes have been removed and how many percent of storage space you saved.

To

The To options specifies a optional target document where the converted data should end up. If you do not give any, the original document is overwritten. This means that all data RaiskaaHTML removed are lost.

Toggle

With the Toggle switch enabled, all HTML options described below are toggled. That means all such switches are enabled excepts those you specified.

If you specify other switches together with Toggle, they are disabled. For example Toggle DocType removes everything possible but preserved the document type declaration.

HTML Options

Blink

With the Color switch enabled, all <blink> tags are removed. This can make your document a lot more pleasant to read.

Color

With the Color switch enabled, all color information specified with <body> and <font> is removed.

DocType

With the DocType switch enabled, a possible document type declaration in the first line of the document is removed. This saves somes space, but you won't be able to use any SGML parser on the document.

So you better enable this only if you really know what you are doing.

Error Handling Options

OnError

With the OnError option you can specify what to do about documents containing faulty HTML code. Possible values are (upper/lower case does not matter):

Abort: Do not process any further documents and exit.
Ask: Pop up a requester for every faulty document and let you decide what to do about it. The choices in the requester represent all other possible values.
Retry: Wait until you fixed and saved the document, then try to process the same document once more. If it fails again, keep trying like before.
Skip: Skip the faulty document and continue with the next one. No changes are made to the document.

The default value is Ask.

Ignore

With the Ignore switch enabled, warnings about faulty HTML code are ignored and the document is still written.

Use this switch with caution when processing directories as there is no way to restore the original document once it has been overwritten.

ScMsg

With the ScMsg switch enabled, parser related error messages are no more displayed in the console but sent to the ScMSg message browser.

ScMsg is part of the SAS/c compiler package. Naturally it has to be installed before using this switch. If the message browser is not already running, the script will start it automatically (expecting it to be in sc:c/ScMsg).

Copyright

RaiskaaHTML is freeware. You can use it without having to pay and you can freely redistribute it as long all files coming with the archive are preserved and no files are added or removed.

You use this material at your own risk. No responsibilities are taken for trashed HTML documents, damaged Amigas or any other components or data involved while using RaiskaaHTML.

Support and Updates

New versions of RaiskaaHTML are uploaded to aminet:comm/www/RaiskaaHTML.lha, check aminet:comm/www/RaiskaaHTML.readme to find out if there is any.

Suggestions are not really welcome. This tool does what I want it to do, extending it hardly makes sense. See also Future.

Bug reports can be sent to Thomas Aglassinger <agi@sbox.tu-graz.ac.at>.

Future

Basically I don't want to change much on it. That means I'm not thinking about adding a configuration file where the user can specify which attributes to strip from which tag or such things. Although this would be more flexible, it would mostly make the program more difficult to use.

If a <font face=..> is reduced to <font>, the whole tag could be stripped. This would need to implement some stack, which is pretty dull to do.

One thing that would be interesting is to add a GUI, preferably by using MUIRexx. Not that I plan to do it, but if somebody else improves the ARexx script, I will happily include these changes. This would make the script much more efficient to be used from the browser or DOpus as one could then specify a static set of options and change them on the fly depending on the files to process.

History

Version 1.1, 5-Jan-1999

Fixed bug: options Meta and Link were always enabled independent of the CLI arguments
Fixed problem: in documents created on MS-DOS based systems, the CR/LF pair used to denote an end of line was stripped to CR. When displayed with a buggy browsers like VNG this could lead to words at the end of a line being glued together with the first word from the next line.
Added feature not to write documents if warnings occurred
Added CLI option Ignore to ignore warnings
Added CLI option OnError to let user decide what to do about faulty documents
Added CLI option Linefeed to get rid of CR/LF mess
Added feature to strip color and face from <basefont> if Color resp. Font enabled
Added picture with excerpt from "Suomi-Englanti sanakirja"

Version 1.0, 24-Sep-1998

Initial release

RaiskaaHTML

Table of Contents

Template

Version 1.1, 5-Jan-1999

Version 1.0, 24-Sep-1998